4-5/5/2021

1. Data Analysis

Data Analysis in the Scientific Cycle

Data-Intensive Research

  • Science and humanities are increasingly data-driven
    • Early-career training has not prepared all researchers for this

Research Workflows

  • Enable systematic, replicable and reproducible work
    • Design principles
      • Best practices for data
    • Software development methods
      • Automation of repetitive calculations

Pipelines and Workflows

Pipeline

  • What a computer does
    • A series of instructions
    • Data is piped through programs, and a result emerges

Workflow

  • What a researcher does
    • Exploring data, developing hypotheses, writing code, interpreting results
  • Outputs include:
    • datasets, methods, teaching materials, software, papers, etc.

Explore, Refine, Produce (ERP)

2. Welcome to R

Learning Objectives

  • Fundamentals of R and RStudio
  • Fundamentals of programming (in R)
  • Data management with the tidyverse
  • Publication-quality data visualisation with ggplot2
  • Reporting with RMarkdown

What is R?

  • R is:
    • a programming language
    • the software that interprets/runs programs written in the R language

Why use R?

  • free (though commercial support can be bought)
  • widely used
    • sciences, humanities, engineering, statistics, etc.
  • has many excellent specialised packages for data analysis and visualisation
  • international, friendly user community

What is RStudio?

Please start RStudio

  • RStudio is an integrated development environment (IDE)
  • Script/code editor; Project management
  • Interaction with R (console/‘scratchpad’); Graphics/visualisation/Help

“Why not use Excel?”

  • Excel is good for some things
  • R is excellent for analysis and reproducibility…
  • Separates data from analysis
  • Not point-and-click: every step is explicit and transparent
  • Easy to share, adapt, reuse, publish analyses with new/modified data (GitHub)
  • R can be run on supercomputers, with extremely large datasets…

RStudio overview - INTERACTIVE DEMO

Variables

Variables are like named boxes

  • An item (object) of data goes in the box (which is called Name)
  • When we refer to the box (variable) by its name, we really mean what’s in the box

Variables - Interactive Demo

x <- 1 / 40
x
## [1] 0.025
x ^ 2
## [1] 0.000625
log(x)
## [1] -3.688879
name <- "Samia"
name
## [1] "Samia"

Naming Variables

Variable names are documentation

current_temperature = 28.6
subjectID = "GCF_00001236452.1"
GPS_Location = "54N, 36E"
  • descriptive, but not too long
  • letters, numbers, underscores, and periods ([a-zA-z0-9_.])
  • cannot contain whitespace or start with a number (x2 is allowed, 2x is not)
  • case sensitive (Weight is not the same as weight)
  • do not reuse names of built-in functions
  • Consistent style:
    • lower_snake, UPPER_SNAKE, lowerCamelCase, UpperCamelCase

Functions

Functions (log(), sin() etc.) ≈ “canned script”

  • automate complicated tasks
  • make code more readable and reusable
  • Functions usually take arguments (input)
  • Functions often return values (output)
  • Some functions are built-in (in base packages, e.g. sqrt(), lm(), plot())
  • Groups of related functions can be imported as libraries

Getting Help in R

INTERACTIVE DEMO

args(fname)            # arguments for fname
?fname                 # help page for fname
help(fname)            # help page for fname
??fname                # any mention of fname
help.search("text")    # any mention of "text"
vignette(fname)        # worked examples for fname
vignette()             # show all available vignettes

Challenge 01 (1min)

What will be the value of each variable after each statement in the following program?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
  • mass = 47.5, age = 102
  • mass = 109.25, age = 102
  • mass = 47.5, age = 122
  • mass = 109.25, age = 122

USE CHALLENGE LINK ON ETHERPAD

3. Project Management in R

How Projects Tend To Grow

Good Practice

THERE IS NO ONE TRUE WAY (only principles)

  • Use a single working directory per project/analysis
    • easier to move, share, and find files
    • use relative paths to locate files
  • Treat raw data as read-only
    • keep in a separate subfolder (data?)
  • Clean data ready for work programmatically
    • keep cleaned/modified data in separate folder (clean_data?)
  • Consider output generated by analysis to be disposable
    • can be regenerated by running analysis/code

Example Directory Structure

Project Management in RStudio

Working in RStudio

We can write code in several ways in RStudio

  • At the console (you’ve done this)
  • In a script
  • As an interactive notebook
  • As a markdown file
  • As a Shiny app

We’re going to create a new dataset and R script.

  • Putting code in a script makes it easier to modify, share and run

INTERACTIVE DEMO

4. A First Analysis in RStudio

Our Task

  • Patients have been given a new treatment for arthritis
  • We have measurements of inflammation over a period of days for each patient
  • We want to produce a preliminary analysis and graphs for this data

Download the file from the following link to your data/ directory, and extract it

(the link is also available on the course Etherpad page)